Getting Started: R and Google BigQuery

Getting Started: R and Google BigQuery

Please use a computer and network that will allow you to use Google products (gmail, Google Drive, etc.). Some workplaces do not allow this on their devices!

If you don’t already have a personal (not work) Google Cloud Platform account, you will set one up in the first section of today’s course. You will need:

  • a mobile phone number and
  • a credit card, but will not be charged (I PROMISE)

If you’re an established GCP user, you might be past your “free tier” usage this month and may accrue a small charge for today’s work, but it’s unlikely.

05:00

BigQuery, According to Google

“BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery’s serverless architecture lets you use SQL queries to answer your organization’s biggest questions with zero infrastructure management.”

This slide deck was built in Quarto!

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

About Your Presenter

Joy Payton (she/her)

Data Scientist / Data Educator

https://www.linkedin.com/in/joypayton/

I have no conflicts of interest to report.

Materials for Today

Please grab these links and use them!

Slide shows:

(Will give you links to all three hours of slide decks)

GitHub repository: https://github.com/pm0kjp/rmedicine_2024_bigquery/

Our Itinerary

Hour 1: GCP and BigQuery Orientation

Hour 2: BigQuery Data, SQL in BigQuery, Gemini

Hour 3: R/RStudio and BigQuery Integration

Itinerary for First Hour

  • Getting started in GCP
  • Creating a new Project
  • Enabling BigQuery
  • Adding Data

What is GCP?

GCP, or Google Cloud Platform, is a public cloud provider.

It’s similar to other offerings you may have heard of, like AWS or Azure.

Cloud providers are increasingly important in medicine!

Already a GCP User?

If you already have a GCP account that you’ll be using for this workshop

Option A: go get a cup of coffee and we’ll see you back here in about 15 minutes. You might end up spending money on our activities today!

Option B (recommended): stick around to create a new account with a brand new “welcome to GCP” free trial worth $300 in GCP services

Google Identity

If you need to create a new Google identity, please go to https://accounts.google.com now and create a new account. Even if you already have one, this is a way to guarantee you’ll be working in the free tier with some Google credits!

05:00

Setting Up GCP

Have your Google identity? Now you can go to https://console.cloud.google.com to sign up for GCP.

You have to do two things:

  1. Agree to terms and
  2. Activate your free trial
  • Agree to THOSE terms
  • Set up a new “payments profile” and add a credit card number

Agreeing to terms

First, agree to terms (the easy part).

Check the box and click “Agree…”

Free Trial

Now, start the free trial: a bit more complex.

05:00

Google Cloud Platform Toolbar

Gemini: Generative AI

Let’s experiment. Click on the Gemini “star” and ask a question about BigQuery or GCP. I thought it might be interesting to ask about BigQuery and medicine.

Gen AI Caveats

GCP Data Solutions: BigQuery

  • BigQuery is not just “giant SQL”: it’s for warehousing
  • It is not intended for production transactional work:
  • No foreign keys
  • Column based storage means inefficient lookups of single rows
  • Transactional consistency not guaranteed
  • BigQuery has its own SQL dialect (Mostly what you’re used to)

Getting Started in GCP

Google Cloud Platform (GCP) organizes resources by project.

Optionally, you can also define an organization and group projects by folder.

Get Started With BigQuery!

To get started with BigQuery, you will:

  • Create a new project
  • Project name – mutable, you assign
  • Project ID – immutable, you assign
  • Project number – immutable, Google assigns
  • Add BigQuery as a resource
  • Use datasets you create and/or public datasets you can access

Optionally, add other resources:

  • Google Cloud Storage to hold large files for ingestion by BigQuery
  • Containers or VMs with analytic software like R / Stata
  • Service accounts to interface with BigQuery

Exercise

  • Create a new project with a name and ID that work for you
  • Open BigQuery
  • Preview public datasets

Create A New Project

  • In the project selector, click downwards facing triangle
  • Select “New Project”
  • Add a project name, edit the project ID (optional)
  • Project ID must be globally unique!
  • Leave “Location” as “No Organization”
  • Click “Create”
03:00

Open BigQuery

  • Once you’re in your project, click the “burger” menu (☰).

  • BigQuery is probably already pinned (click on it). If you don’t see BigQuery:

  • Choose “View all Products”

  • In “Analytics”, click on BigQuery (you may also want to “pin” this to the top of your menu).

  • Enable the BigQuery API to add BigQuery to your project

Preview Public Datasets

  • In Explorer, click on “+ Add Data”
  • Look for the “Additional Sources” option
  • Choose “Public Datasets”

Find Public Databases of Interest

Look for Area Deprivation Index (ADI) by searching for it or looking in the healthcare category.

02:00

Add a Public Dataset to Your Project

In the “View Dataset” screen, the specific dataset will be highlighted in the left panel, which shows your current data. Please click on the star icon.

Do the same thing for CDC Natality Database

Look for the CDC Births data by searching for it or looking in the healthcare category. View it and “star” it!

03:00

Viewing “Starred” Data

Since there are so many public datasets, Google no longer displays those by default in BigQuery, even though you have access to them. That’s why we “starred” them.

Now we can select to show “only starred data”.

Structures of Data

  • Projects contain datasets
  • Datasets contain tables (point-in-time, established data) and views (live filters that show current data)

Break

Let’s take a break. So far you:

  • Created a GCP account
  • Created a new project
  • Added BigQuery as a resource to your project
  • Identified (starred) public data of interest

During break, you can either relax, or, if you want:

  • Experiment with interrogating Gemini
  • Look around at other public datasets
  • Try some of the other “Add Data” methods to add your own (non-regulated, not-PHI, etc.) data to your project
20:00